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Abstract 

Data which lie in the space V m , of m x m symmetric positive definite matrices, (sometimes called tensor data), play a 
fundamental role in applications including medical imaging, computer vision, and radar signal processing. An open challenge, for 
these applications, is to find a class of probability distributions, which is able to capture the statistical properties of data in V m , as 
they arise in real-world situations. The present paper meets this challenge by introducing Riemannian Gaussian distributions on 
Vm • Distributions of this kind were first considered by Pennec in 2006. However, the present paper gives an exact expression of 
their probability density function for the first time in existing literature. This leads to two original contributions. First, a detailed 
study of statistical inference for Riemannian Gaussian distributions, uncovering the connection between maximum likelihood 
estimation and the concept of Riemannian centre of mass, widely used in applications. Second, the derivation and implementation 
of an expectation-maximisation algorithm, for the estimation of mixtures of Riemannian Gaussian distributions. The paper applies 
this new algorithm, to the classification of data in V m , (concretely, to the problem of texture classification, in computer vision), 
showing that it yields significantly better performance, in comparison to recent approaches. 

Index Terms 

Symmetric positive definite matrices, tensor, Riemannian metric, Gaussian distribution, expectation-maximisation, texture 


I. Introduction 


It has been known for quite some time, in fields ranging from multivariate statistics and information geometry 00- to 
matrix analysis 0, harmonic analysis and number theory 00, that the space V m , of m x m symmetric positive definite 
matrices, can be equipped with a Riemannian metric, which gives it the structure of a Riemannian homogeneous space of 
negative curvature. In the present paper, this Riemannian metric is called the Rao-Fisher metric, (another popular name is 
affine-invariant metric), and is the subject of Section [II] 

During the past ten years, in response to the need for effective methods of processing data which lie in the space V m , 
the Rao-Fisher metric, (in addition to an alternative so-called log-Euclidean metric ©)- has been widely used in engineering 
applications, which include medical imaging |7||8|, continuum mechanics |j9|], radar signal processing G9GD- and computer 
vision G3-G3- These applications have evolved useful techniques for analysis, representation and classification of data which 
lie in the space V m ■ An introduction to these techniques may be found in ]T7) . 

However, the literature still lacks a probabilistic model, rigorously defined yet tractable, which is able to represent the 
statistical variability of data in V m , and to bring the machinery of statistical inference, (estimation and hypothesis testing), to 
bear on such data. The present paper develops such a probabilistic model, by introducing a new class of probability distributions 
on the space V m , called Riemannian Gaussian distributions. These distributions are the subject of Section [HI] below. 

A Riemannian Gaussian distribution, denoted G(Y,o), depends on two parameters, Y £ V m and a > 0. The expression of 
its probability density function, generalising that of a Gaussian distribution on the Euclidean space R p , is given by 


P{Y\Y,(j) 
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(1) 


Here, d : V m x V n 


is Rao’s Riemannian distance, and the density is with respect to the Riemannian volume element 


of V m , which is henceforth denoted dv(Y ), (for required mathematical definitions, see Paragraph H-Ai. In comparison to a 
Gaussian distribution on R, the normalising factor £(cx) plays the role of the factor \/ 2tt a 2 . 

It will be seen that Riemannian Gaussian distributions provide a statistical foundation for the concept of Riemannian centre 
of mass, (also called geometric mean, Riemannian mean, or Frechet mean), currently essential to many applications [T8||l9). 
Recall the Riemannian centre of mass Y N , of data points Yj,..., Y N in V r „ , is the unique global minimiser of S N : V m —> R+, 


= E d '- 

n—1 


(Y,Y n ) 


( 2 ) 


where, again, d : V m x V m R_|_ is Rao’s Riemannian distance. Since Y N minimises the sum of squares of distances to the 
points Yi,.... y v , it is widely viewed as a representative, average, or mode of these points. 

Distributions of the form 0 were considered by Pennec, who defined them on general Riemannian manifolds |20) . However, 
in existing literature, their treatment remains incomplete, as it is based on asymptotic formulae, valid only in the limit where 
the parameter a is small, see p0|-p2]. In addition to being inexact, such formulae are quite difficult, both to evaluate and to 
apply. These issues, (lack of an exact expression and difficulty of application), are overcome in the following. 
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Indeed, Section [TIT] opens with Proposition |4j which gives an exact expression of the normalising factor £(cx), appearing in 
([lj. This confirms the important property that this factor does not depend on the parameter Y. Proposition [ 4 ] is followed by 
Propositions [5] and [6] which together define a general method for sampling from a given Riemannian Gaussian distribution. 
These three propositions fully characterise Riemannian Gaussian distributions both theoretically and in view of computer 
simulations. 

Propositions [7] and [9] describe the relationship between Riemannian Gaussian distributions and the concept of Riemannian 
centre of mass. Proposition [7] states that the maximum likelihood estimate of the parameter Y of a distribution C{ Y. a), based 
on samples Yi, ... ,Y N from this distribution, is equal to the Riemannian centre of mass Y N , global minimiser of (|2j. In short, 
for Riemannian Gaussian distributions, maximum likelihood is equivalent to Riemannian centre of mass. 

Proposition [9] is the asymptotic counterpart of Proposition [7] corresponding to the limit N —> oo . It states that the parameter 
Y of a distribution G(Y. a) is the Riemannian centre of mass of this distribution, (see [ 181, for the concept of Riemannian 
centre of mass of a probability distribution). This means that Y is the unique global minimiser of £ : V m —> R+, 


£(Y)= d 2 (Y 7 Z)p(Z\ Y,a) dv(Z) (3) 

Note that, by the law of large numbers, £ N {Y) —f £(Y) as N —> oo, so ([3|> is the asymptotic counterpart of (j2j). Propositions 

|7] and [9] allow for the concept of Riemannian centre of mass to be studied within the framework of parametric statistical 
inference. Compare to p3|p4), which have a non-parametric approach. 

This is the object of Proposition [8j which deals with hypothesis testing, (testing against a given value of Y), and of 

Propositions 10 and [TT[ which establish that Y N is asymptotically normally distributed about Y, in the limit N —> oo, and 


give its asymptotic covariance matrix, (this makes it possible to construct asymptotic confidence regions for F). These results 
are intended to assist users wishing to assign a statistical significance to the Riemannian centre of mass Y N , computed from 
data points Yj..... Y N . 

Before going on, note that, in Section 
order to distinguish it from Y. Precisely, 


III 


Y N is called by the different name of empirical Riemannian centre of mass, in 
Y n is an estimate, while Y is a parameter. 

The use of Riemannian Gaussian distributions in the representation and classification of data in V m is considered in Section 


IV 


This section is motivated by the idea that the class of mixtures of Riemannian Gaussian distributions is expected to 
be sufficiently rich, in order to represent the statistical distribution of data in V m which arise in real-world applications, 
(experimental verification of this idea is still ongoing g5)). A mixture of Riemannian Gaussian distributions is a probability 
distribution on V m which has probability density function. 


P( Y ) = 

M= 1 


x p(Y\ Yfj,, (jp) 


(4) 


■ ZUn 


= 1, and where each density p(Y\ Yf 


where zui,, w M are non-zero positive weights which satisfy vo\ + . 
given by (|Tj», with Y^ £ V m and > 0. 

The estimation of mixtures of Riemannian Gaussian distributions is treated in Paragraph IV-A which derives a new EM 
(expectation-maximisation) algorithm for computing maximum likelihood estimates of mixture parameters d = {( zu^,, Y],, cr ;l ); 
H = 1 This algorithm appears as a generalisation, to the context of the Riemannian geometry of the space V m , 

of EM algorithms currently used for estimation of mixtures of parametrised probability distributions on a Euclidean space. 



fact that the space V m , equipped with the Rao-Fisher metric, is a Riemannian homogeneous space of negative curvature. In 
particular, these results have no meaningful counterpart using alternative Riemannian metrics, such as the above-mentioned 
log-Euclidean metric. 

On the other hand, they generalise immediately to any Riemanniann homogeneous space of negative curvature. Spaces of this 
kind include the space of univariate normal distributions (37), the space of Toeplitz autocovariance matrices and the space 
of Block-Toeplitz autocovariance matrices which have Toeplitz blocks |32j[33), with further examples given in |34j|35). The 
definition and properties of Riemannian Gaussian distributions can be generalised to any of these spaces. This generalisation 
is the focus of ongoing work. 
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To place the present paper in this general context, consider briefly the definition of Riemannian Gaussian distributions on 
an arbitrary Riemannian homogeneous space of negative curvature, Ad. This gives an abstract picture of Riemannian Gaussian 
distributions, which will be made concrete as of Section [II] by putting Ad = V m ■ The definition is based on three essential 
ingredients: the existence of an invariant distance, the existence of invariant integrals, and the existence and uniqueness of 
Riemannian centres of mass. In the special case At = V m , these will be formulated precisely in Propositions [T] [2] and [3] of 
Section [TT] 

The space Ai is equipped with a Riemannian metric, which defines a Riemannian distance d : At x At —> R + . The distance 
between two points x, y £ At is then d{x,y). To say that Ad is a Riemannian homogeneous space means two things Si- 
First, a group of transformations G acts transitively on Ad. That is, each element g £ G defines a transformation of A i, 
mapping each point x £ Ad to its image x ■ g. Moreover, for any two points x,y £ Ad there exists g £ G such that y = x ■ g. 
Second, the Riemannian distance is invariant under the group of transformations G, 


invariant distance : d{x, y) = d{x • g,y ■ g) x, y £ Ad , g £ G 

The notation x • g stands for right action of G on Ad. That is, x ■ (c/iffc) = (ir ■ gf) • g 2 , for g 1 , g 2 £ G. The choice of right action 
over left action, (which would be denoted g ■ x, instead of x ■ g), is a matter of convention, which is here adopted throughout 
the paper. 

A corollary of the invariance of Riemannian distance is that it is possible to define invariant integrals. Precisely, the 
Riemannian metric of Ai defines a Riemannian volume element, dv{x) |5|. Integrals with respect to this volume element 
are invariant under the group of transformations G, 


invariant integrals : 


f(x)dv(x) = f(x- g)dv(x) g£G 


This property means that the integral of any function / : Ad —>■ R is equal to the integral of this function composed with a 
transformation x \—tx-g. Intuitively, it is a direct result of the invariance of distance, through the basic relationship between 
the concepts of length and volume. 

In the special case Ai = V m , the existence of invariant distance and of invariant integrals are formulated in equations 
(17b I and (|18a|) below. In all generality, these two properties supply the expression of the probability density function of a 


Riemannian Gaussian distribution on At, 

Riemannian Gaussian p.d.f. : 


p(x\x,Cr) = 


1 


CW 


exp 


d 2 (x, x) 
2 ^~ 


where x £ Ad and a > 0 are parameters, and the density is with respect to the Riemannian volume element d.v(x). 

The crucial feature of this expression of the probability density is that the normalising factor £(<r) does not depend on the 
parameter x, but only on the parameter er. This follows from the existence of invariant distance and of invariant integrals, as 
can be seen by repeating word for word the proof of item (i) of Proposition [4] Section III 


This feature implies the special relationship between Riemannian Gaussian distributions and Riemannian centres of mass. 
Namely, for these distributions, maximum likelihood is equivalent to Riemannian centre of mass. Indeed, if ..... x. v are 
samples from the above probability density, the resulting log-likelihood function is 


log-likelihood : 


N 

E 

n — 1 


1 x > 

log p(x n \x,o) = -N\og((o) -- d 2 (x n ,x) 


Since the first term on the right hand side does not depend on x, it follows that the maximum likelihood estimate x N of the 
parameter x should minimise the sum of squares of distances to the samples x n . In other words, it should be a Riemannian 
centre of mass of these samples, (compare to formula |[2}, above). In the special case At = V m , this reasoning will appear in 
the proof of Proposition [7] Section [ill] 

At this point, the fact that Ad has negative curvature becomes important. It is this fact that guarantees the existence and 
uniqueness of Riemannian centres of mass in Ad |18|p9|, so that the maximum likelihood estimate x N of the parameter x is 
well-defined and unique. In the special case Ad = V m , this is stated in Proposition [3] of Section [TT] 

With regard to the maximum likelihood estimate d N of the parameter o, a more detailed knowledge of the space Ai is 
required. Indeed, to establish existence and uniqueness of d N , it is necessary to show that log ((a) is a strictly convex function 
of the “natural parameter” r/ = —1/a 2 . In the special case Ad = V m , this follows from item (ii) of Proposition [d] as stated 
in the proof of Proposition [7] 

To obtain the same result for a general Riemannian homogeneous space of negative curvature, Ai, it is necessary to 
decompose Ad into a direct product of symmetric spaces of Euclidean or of non-compact type, and consequently use integration 
formulas on these Symmetric spaces, which may be found in |4). In a future submission, this will be carried out for all above 
mentioned examples, (e.g., spaces of autocovariance matrices with Toeplitz structure |10}, or Block-Toeplitz structure with 
Toeplitz blocks |32|j33|). However, the present paper will exclusively focus on the special case Ai = V m , since it is relevant 
to a wider range of applications, and requires less mathematical background. 
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II. Riemannian geometry of covariance matrices 
Let Vm denote the space of all to x to real matrices Y which are symmetric and strictly positive definite, 

Y f -Y = 0 x f Yx > 0 for all ieK™ (5) 

where f denotes the transpose. The present section is concerned with the Riemannian geometry of the space V m , when this 
space is equipped with the Rao-Fisher metric 00- 

Paragraph |II-A| gives the expressions of the Rao-Fisher metric, and of the related Riemannian distance (Rao’s distance), 
volume element, and geodesic curves. 

Paragraph |II-B| states the fundamental geometric properties of the space V m , in view of the development of following sections. 
Essentially, these properties reflect the fact that V m , (equipped with the Rao-Fisher metric), is a Riemannian homogeneous 
space of negative curvature. 


A. The Rao-Fisher metric: distance, geodesics and volume 

A Riemannian metric on V m is a quadratic form ds 2 (Y) which measures the squared length of a small displacement dY , 
separating two elements Y £ V m and Y + dY £ V m ■ Here, dY is a symmetric matrix, since Y and Y + dY are symmetric, 
by 0. The Rao-Fisher metric is the following @0- 

ds 2 (Y) =tr[F- 1 dY] 2 (6) 


where tr denotes the trace. When there is no loss of clarity, this is written ds 2 {Y) = ||dF|| 2 . 

The Rao-Fisher metric, like any other Riemannian metric on P m , defines a Riemannian distance d : V m x V m —> R_. This 
is called Rao’s distance, and is defined as follows @@ Let Y, Z £ Vm and c : [0,1] — > V m be a differentiable curve with 
c(0) = Y and c(l) = Z. The length L(c) of c is defined by 

L(c) = f ds(c(t)) = f ||c(i)|| dt (7) 

Jo Jo 

where c(f) = . Rao’s distance d(Y, Z) is the infimum of L(c ) taken over all differentiable curves c as above. 

When equipped with the Rao-Fisher metric, the space V rn is a Riemannian manifold of negative sectional curvature 00 . 
One implication of this property, (since V m is also complete and simply connected), is that the infimum of L(c) is realised 
by a unique curve 7 , known as the geodesic connecting Y and Z. The equation of this curve is the following, 

7 (f) = Y 1/2 (Y- 1/2 ZY~ 1/2 y Y 1/2 for t £ [0,1] ( 8 ) 

Given expression 0 , it is possible to compute L( 7 ) from 0. This is precisely Rao’s distance d(Y, Z). It turns out 0 , 

d 2 (Y, Z) = tr [log {Y- 1 / 2 ZY- 1 / 2 )} 2 (9) 

All matrix functions appearing in 0 and 0, (square root, elevation to the power t, and logarithm), should be understood as 
symmetric matrix functions. For this, see [| 36| . 

Since the Rao-Fisher metric gives a means of measuring length, it can also be used to measure volume. Roughly, this is 
based on the elementary fact that “the volume of a cube is the product of the lengths of its sides.” The Riemannian volume 
element associated to the Rao-Fisher metric is the following (5j, 

dv(Y) = det(Y) 0 dY t j (10) 

i<j 


where indices denote matrix entries. 

To close this paragraph, consider the expressions of ds 2 (Y) and dv(Y), given by 0 and (10 1 , in terms of polar coordinates. 
The use of polar coordinates refers to the parametrisation of Y £ V m by its eigenvalues and eigenvectors. For (ri,..., r m ) £ R m 
and U £ O (m), where O(to) is the group of to x to real orthogonal matrices, let diag(e r ) be the diagonal matrix with main 


diagonal 


and 


Y(r, U) = U f diag(e r ) U 


( 11 ) 


Then, it is clear that Y(r, U) belongs to V m ■ Conversely, given any Y £ V rn , by writing down the spectral decomposition of 
Y, it is possible to find (r, U ) such that Y = Y{r , U ). 

Expressed in polar coordinates, ds 2 {Y) and dv(Y) appear as follows, 

m / _ \ 

ds 2 {Y) = ]T dr 2 + 8 £ sinh 2 ( (12) 

j= 1 i<3 ^ 2 

dv{Y) = 8 ~^ * det(0) 0 sinh 0 drj (13) 

i<j ' 2 ' i=i 
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where Oij = Ujk dUik , and the notation det(0) is used for the exterior product, 

det(0) = f\ Oij (14) 

i<j 

Under a slightly different form, these expressions are given in |5j (Exercise 25, Page 24). 

When using expression ( p~3|) to compute integrals with respect to the volume element dv(Y ), one should be careful that the 
correspondence between Y and (r, U ) is not unique. In fact, even when Y has distinct eigenvalues, there are m! 2 m ways of 
choosing (r, U). The factor to! corresponds to all possible reorderings of ri,... ,r m , and the factor 2 m corresponds to the 
orientation of the columns of U, (multiplication by +1 or —1). Accordingly, for any function / : V m —> R, 

f f(Y)dv(Y) = (m\ 2Tx8^ / f /(Y(r, U)) det(g) TT sinh ( ,r< ~ ^ ) TT d n (15) 
Jv m Jo(m) Jr™ iKj V 1 / i=1 

where division by to! 2 m cancels out the ambiguity in choosing (r, U). This formula is given in |5j, (Proposition 3, Page 43). 


11. The Rao-Fisher metric: geometric properties 

The Rao-Fisher metric, when introduced on the space V m , leads to many important geometric properties. Among these 
properties, the ones which will be used in following sections are here recalled and briefly explained. 

First, consider the fact that this metric turns the space V m into a Riemannian homogeneous space under the action of the 
linear group GL(to). Recall GL(to) is the group of to x to real invertible matrices. This group acts on V m by congruence 
transformations, which are defined as follows 0, 

(Y, A) ^ Y ■ A Y ■ A = A*YA (16) 


for Y £ V m . and A £ GL(to). The notation Y-A is well suited, in view of the right action property: Y ■ (A1A2) = (Y ■ Ai)-A 2 , 
for Ai,A 2 £ GL(to). 

The term “homogeneous space” means that for any Y, Z £ V m there exists A £ GL(to) such that Y ■ A = Z. One possible 
choice of A is A = Y~ x l 2 Z 1 / 2 . Moreover, the term “Riemannian homogeneous space” means that the Rao-Fisher metric and 
Rao’s distance remain invariant under the action of GL(to) on V m ■ This is stated in Proposition [I] 

Proposition 1 (Riemannian Homogeneous space): For Y, Z £ V m and A £ GL (to), the following hold. 


ds 2 (Y) = ds 2 (Y ■ A) 

ds 2 (Y) = ds 2 (Y~ l ) 

(17a) 

where ds 2 (Y) is the Rao-Fisher metric 0. 

d(Y, Z) = d(Y ■ A, Z ■ A) 

d(Y,Z) = d(Y- 1 ,Z~ 1 ) 

(17b) 


where d : V m x V m —> R+ is Rao’s distance 0. 

Proof: Identities (17i are well-known in the literature 00- They state that congruence transformations, as well as matrix 
inversion, are isometries of the space V m , equipped with the Rao-Fisher metric. 


By general arguments from Riemannian geometry, (17b 1 is a direct result of (17a 1 . Here, as an illustration, is the proof of 
the first identity in ( | 1 7a[ >. Let W = Y ■ A, so dW = dY ■ A. Replace in 0, 

ds 2 (W) = tr [{A^YA^AUYA} 2 = tr [Y~ 1 dY] 2 


which is just ds 2 (Y). For the second identity, recall that if W = Y 1 then dW = — WdYW , see [36|. ■ 

The following Proposition [ 2 ] states that integrals with respect to the Riemannian volume element dv(Y) are invariant under 
the action of GL(to) on V m ■ 

Proposition 2 (Invariant integrals): For any function / : V m —> R, and any A £ GL(to), 


[ f(Y) dv(Y) = 

[ f(Y-A) dv(Y) 

(18a) 

>v m 

)v m 


/ f(Y)dv(Y) = 
JVm 

[ /(Y -1 ) dv(Y) 

JVm 

(18b) 


whenever these integrals exist. 

Proof: This proposition is a corollary of Proposition [I] Intuitively, it holds because dv(Y) is the Riemannian volume element 
of ds 2 (Y), and ds 2 (Y) is invariant under congruence transformations and inversion, according to Proposition [I] A simple 
proof may be found in |[5). ■ 

The third and last property of the Rao-Fisher metric, which will be needed in the following, refers to the existence and 
uniqueness of Riemannian centres of mass. 
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Let 7r be a probability distribution on V m - The variance function of n is 8^ : V m —> R+, 

£n(Y)= f d 2 (Y,Z)dn(Z) (19) 

Following |l^8 |, a point Y n £ V m , which is a global minimiser of £ v , is called a Riemannian centre of mass of n. 

The following Proposition [3] is a result of the property that the space V m ■ equipped with the Rao-Fisher metric, is a 
Riemannian manifold of negative curvature. 

Proposition 3 (Riemannian centre of mass): If tt is a probability distribution on V m , then tt has a unique Riemannian centre 
of mass Y -. Moreover, Y„ is the unique stationary point of the variance function £~. 

Proof : The proposition is a corollary of a general theorem in fl8) (Theorem 2.1., Page 659). This theorem states the proposition 
will hold, as soon as it can be shown that V m , equipped with the Rao-Fisher metric, is a Riemannian manifold of negative 
sectional curvature. However, this is a well known fact, which may be found in |4'|[j5j. Moreover, in 0 (Theorem 2.2.2., 
Page 428), an explicit formula is given, for the Riemannian curvature tensor of V m , and is used through a direct calculation, 
to find the sectional curvature of V m and check that it is negative. ■ 


III. Riemannian Gaussian distributions 

The main theoretical contribution of the present paper is to give an exact formulation of Riemannian Gaussian distributions. 
A Riemannian Gaussian distribution G(Y, a) is a probability distribution on V m , given by the probability density function, 
with respect to the Riemannian volume element (fi~0|). 


p{Y\Y,a) = 


C(cO 


exp 


d\Y,Y) 
2 a 2 


( 20 ) 


where Y £ V m and cr > 0 are parameters, and d( Y. Y) is Rao’s distance given by (|9ji. For brevity, the term Gaussian distribution 
will be used, instead of Riemannian Gaussian distribution. The present section is organised as follows. 

is concerned with the definition of Gaussian distributions. In this paragraph. Proposition [4] gives an exact 


Paragraph III-A 


expression of the normalising factor £(cr), and Propositions |5]and|6]define a general method for sampling from a given Gaussian 
distribution. 

Paragraph III-B studies statistical inference problems for Gaussian distributions. In this paragraph. Proposition [7] deals with 
maximum likelihood estimation of the parameters Y and er, while Proposition [8] deals with the problem of hypothesis testing, 
(precisely, te sting against a given value of Y). 

Paragraph III-C recovers the asymptotic properties of the maximum likelihood estimate of Y. These properties, of consistency 
and asymptotic normality, are given in Propositions [TO] and m They essentially rely on Proposition [9] which states that the 
parameter 7 of a distribution G(Y. o) is the Riemannian centre of mass of this distribution. 

The idea of considering probability densities on V m , which depend on Rao’s distance as in ( [20] ), is due to Pennec J20) . 
What has been lacking, in order to make this idea generally applicable, is a better understanding of the normalising factor 
((a). The desire to achieve such an understanding is the starting point of the present section. 


A. Definition and basic properties 

To define a Gaussian distribution G(Y, cr), by means of the probability density function ( |2()[ , it is necessary to have an 
exact expression of the normalising factor Q(rr ). This is given by the following Proposition [4] 

Item (i) of this proposition confirms an important property of ( )(cr). Namely, thi s nor malising factor does not depend on the 
parameter Y. On the other hand, item (ii) states the expression of £(er), formula ( |24b) >. 

Formula (24b i may look somewhat complicated. However, for values of m up to m = 50, it has been easily evaluated using 
Monte Carlo integration. In particular, this has made it possible to build tables of £(cr) as a function of cr. For m = 2, formula 
yields the analytic expression 

C(cr) = (2 tt) 3 / 2 a 2 x e CT /4 x erf(tr/2) (21) 


( 22 ) 


where erf denotes the error function 

In order to state Proposition [d] consider the following notation. For Y £ V rn and cr > 0, let f(Y\ Y,cr) be given by, 

d 2 (Y,Y r 


f(X I Y > cr) = e x P 


2 a 2 


Also, let C(y,cr) be the integral. 


C (Y,a)= f(Y\ Y,a) dv(Y) 


(23) 


where dv(Y) is the Riemannian volume element (10 1 . 
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Proposition 4 (Normalising factor): The following hold, 

(i) For any Y £ V m and cr > 0, 

C(F,a)=C(/,a) 

where I £ V m is the to x to identity matrix. 

(ii) Let ((a) = ((I, a). Then, 


((a) = (m! 2 m ) 1 x w m x 8 <4 


where uj m is given by 


0 -(r 1 2 + ...+r^)/2o- 2 


2 m 7r m2 /2 


sinh (|Tj - r-j 1/2) dr t 

i<j 


i=1 


(24a) 


(24b) 


(24c) 


r m (m/ 2 ) 

with T m the multivariate Gamma function, given in [38]]. 

Proof: Item (i) follows from Propositions [I] and [ 2 ] in II-B Let Y = I ■ A, in the notation of ( p~6] >. Replacing (17b 1 in ( 22 1, 
it follows that 

f(Y\Y,a) = f(Y-A~ 1 \I,a) 


Then, applying (18a 1 to (23 1, it follows that ((Y,a) is given by 

[ f(Y-A~ 1 \I,a)dv(Y)= [ f(Y\ I, a) dv(Y) 

JVm JVm 

which is just ((I, a). This proves (24a 1. A similar proof of this item may be found in [22), (Theorem 2.1., Page 597). 


Item (ii) follows by using (15 1 to compute the integral £((7) = ((I, a). Indeed, if Y = Y (r, U) as in (111, then it follows 

(25) 


from ([9}, see ED (Theorem 2.2.3., Page 430), 

f(Y\I,a) = e- {r ? + - +r ™ )/2 ' 7 

Since this does not depend on U, replacing in (fl5| gives 


C(ct) = x 8 — 


(m-1) 


o~( r l +---+ r m)/2° 


' O(m) 


det(0) J^sinh 


i<j 


= (to! 2 m ) _1 x 
Recall that, (see (38), Page 71), 


n(m-l) 


> O(m) 


det(0) 


= -(r 1 2 + ...+r 2 )/2a 2 


n s i n h 


i<j 


Ti - r,- 


n dr * 

m 

n dri 


/ 061(0) = 
J O(m) 


(26) 


(27) 


Now, ( |24b| > follows immediately. ■ 

The following Propositions [5] and [6] define a general method for sampling from a given Gaussian distribution, which will be 
described at the end of this section. To begin, Proposition [5] gives the transformation properties of Gaussian distributions. 

For the statement of this proposition, recall that if X is a random variable and p, a probability distribution, then A - ~ // 
means the distribution of X is equal to //. Recall also the notation Y ■ A for congruence transformations, given by ( fi~6] ). 
Proposition 5 (Transformation properties): Let Y be a random variable in V„,. For all A £ GL(to), 


Y ~ G(Y, a) => Y-A~ G(Y ■ A, a) 
Moreover, (recall I £ V m is the to x m identity matrix), 

Y ~ G(I, a) ==► F” 1 ~ G(/, a) 


(28a) 


(28b) 


Proof: This follows from Propositions [I] and [2] in II-B To show (28a). let p : V m -> R be a test function. If Y ~ G(Y,a) 


and Z = Y ■ A, then the expectation of p(Z) is given by 

/ <p(Y-A)p(Y\Y,a)dv(Y) = [ <p(Z)p{Z ■ A~ 1 \Y,<T)dv(Z) 


where the equality is a result of ( | 18 a| ) , and the variable of int egrati on was simply rena med from Y to Z. Now, by (17b 1, 
p(Z ■ A~ 1 \Y,cr) = p(Z\Y ■ A, a). This completes the proof of (28ai. The proof of (28b 1 follows a similar reasoning. ■ 





























Proposition 6 (Gaussian distribution via polar coordinates): Let U and r = (ri,... , r m ) be independent random variables, 
with values in O(m) and R m , respectively. Assume that U is uniformly distributed on 0(m) and that have the 

following joint probability density 

p(r) = (m!2 m )- 1 x w m x x ^( a ) e~^ + - +r ^/ 2 ^ JJainh (|r< — ]/2) (29) 

i<j 

If Y = Y(r , U ) as in © then Y ~ G(J, a). 

Proof: Assume Y = Y(r, U ) and let ip : V m —> R be a test function. It is enough to prove that the expectation of p(Y) is 
given by 

[ <p(Y)p(Y\I,a)dv(Y) (30) 

Jv m 

Recall, (see [381, Page 70), the uniform distribution on O(to) is given by x det(0). Since U and r are independent, the 
expectation of <p(Y) is given by 


’ O(m) 


<p{Y(r, U)) det(0)p(r) dri 


i—1 


Replacing the expression (291 of p(r), and rearranging, this is found equal to 


(™-i) 


(i m\2 m y 1 x 8—x C-» 


<p(Y(r,U))e ( r i +---+ r m)/ 2a det(0) sinh(|ri — rj|/2) dr* 

i<j 


JO(m, •/«- t<j i=1 

By 0 and ( [23] ), this is the same as ( |30] . The proof is thus complete. ■ 

Remark: Proposition (|6|» has the following implication. If Y ~ G(I. a) then the determinant of Y has log-normal 
distribution. Precisely, if t = logdet(F) then t is normally distributed with mean 0 and variance m x er 2 . This can be 
seen from ( [29] ), using the fact that t = r\ + ... + r m . While interesting in itself, it is not used in this paper. 

Propositions [3] and [6] yield the following general method for sampling from a Gaussian distribution. 

— Sampling from G(Y. a): 

► By Proposition [H] if Y ~ G(I,a) then Y ■ Y 1 / 2 ~ G(Y,a). Therefore, it is enough to consider sampling from G(I,cr). 

► By Proposition [6] to sample from G(I,a), it is enough to know how to sample from a uniform distribution on 0(?n), 
in order to generate U, and from the multivariate density ( [29] ), in order to generate r. Once these are obtained, they can be 
replaced in the spectral decomposition Y = Y(r. U ), to generate Y ~G(I, a). 

► Sampling from a uniform distribution on O (m) is fairly straightforward, (see [38], Page 70). Indeed, if A is an rn x rn 
matrix, whose entries are independent and each with normal distribution Af( 0,1), and if A = UT with U orthogonal and T 
upper triangular, then U is uniformly distributed on O(m). 

|. In the 


► Sampling from the density (291 can be achieved by usual application of the Metropolis-Hastings algorithm 
special case m = 2, the density (29 1 can be written as a product of two densities, for two independent variables. Indeed, in 
this case, let t = r 1 + r 2 and p = r 1 — r 2 . It follows from ( [29] ) that 

g -t /4t t xp(p) p(p) (x e~ p sinh(|p|/2) 


p{r) 

Therefore, when m = 2, sampling from the probability density 
densities of t and p. 


(31) 


only requires sampling from the univariate probability 


B. Statistical inference problems 

It is clear that Rao’s distance is the main ingredient in the definition of Gaussian distributions. Accordingly, it is to be 
expected that Rao’s distance play an essential role in statistical inference problems for Gaussian distributions. Here, this is 
considered for the problems of maximum likelihood estimation and of hypothesis testing. 

The following Proposition[7]deals with maximum likelihood estimation. For a Gaussian distribution G(Y. a), this proposition 
shows how maximum likelihood estimates of the parameters Y and a can be computed. 

The more interesting case is that of Y . It turns out that, based on independent samples Vj,..., K v from G(Y,o), the 
maximum likelihood estimate of Y is equal to the empirical Riemannian centre of mass of Y\..... L v . The concept of 
empirical Riemannian centre of mass is among the most widely used concepts in the applied literature | T5J | T9j . Proposition 
[7] shows that Gaussian distributions provide a statistical foundation for this concept. 

The empirical Riemannian centre of mass of Yi,..., Y N is the unique global minimiser Y N of £ N : Vm —f R_, 

£AY) = ^J2 d2 (Y,Y n ) 

n—1 


(32) 
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Comparing this to < 19 1 of II-B it can be seen Y w is the Riemannian centre of mass of the empirical distribution tt n = 


jj(Sy i + ... + <5y- jV ), where 5y denotes the Dirac measure concentrated at Y G V m ■ Thus, existence and uniqueness of Y N 
follow from Proposition [3] 

Proposition 7 (MLE and centre of mass): Let Y\ ..... Y v be independent samples from a Gaussian distribution G(Y, cr). 
Based on these samples, the maximum likelihood estimate of the parameter Y is the empirical Riemannian centre of mass Y N 
of Yi,.. ., Y n . Moreover, the maximum likelihood estimate of the parameter a is the solution d N of the equation. 


0-3 X 3 _1 °gC(o') = £n(Y n ) 
dcr 

where the unknown is cr. Both Y N and o N exist and are unique for any realisation of the samples Y \,..., Y N . 


(33) 


Proof: Using (201, and the fact that Y\,Y N are independent, the log-likelihood function, of the parameters Y and cr, can 
be written 


E lo S P ^\= -JVlogCW - A E d 2 (F, Y n ) 


2 cr 2 

n=i n=i 

Since the first term on the right-hand side does not contain Y, the maximum likelihood estimate of Y can be found by 
maximising the second term. This is equivalent to minimising the sum of squared distances, which also appears in ( |32) >. The 
corresponding global minimum is precisely Y N , whose existence and uniqueness follow from Proposition [ 3 ] (compare to [22|, 
Corollary 2.2., Page 598). To find the maximum likelihood estimate of cr, introduce the new parameter 77 = — 1/cr 2 and let 
^( 77 ) = logC(cr). Then, ip(rf) is a strictly convex function, by application of Holder’s inequality to (24b 1 . Therefore, the log- 
likelihood function, in the above expression, is a strictly concave function of 77. Its maximum is found by solving the equation 
7/7(77) = £ N (Y N ), which is equivalent to (33 1, (here, the prime denotes derivation). Existence and uniqueness of the solution 
d N of ( [33| ) now follow. Indeed, ^>'( 77 ) is a strictly increasing function, since ip(t]) is strictly convex. Also, ijf(rj) tends to 0 as 
77 —> — 00 and to +00 as 77 —> 0 . This implies the equation 7 / 7 ( 77 ) = c has a unique solution for any c > 0 . ■ 

Proposition [ 7 ] indicates how to compute numerically the maximum likelihood estimates Y N and d N . Indeed, this proposition 
states that Y v is the empirical Riemannian centre of mass of the samples Y|..... Y v . As such, the task of computing Y N 
is well-studied in recent literature, and can be carried out using algorithms based on deterministic line-search f2T)[40| , or 
on stochastic gradient descent |41||42], The deterministic Riemannian gradient descent algorithm for computing Y N is given 
below in formulae ( |65) a nd (661 of Paragraph |I V-A based on J2T| . With regard to d N , Proposition [ 7 ] states that it is the unique 
solution of equation (|33|>. This is a non-linear equation in one real-valued unknown, namely cr. As such, its solution d N can 
be computed bv standard application of Newton’s algorithm. 

Proposition M also shows that d N measures the mean dispersion of the samples Y\,... , Y N away from Yy. Indeed, the left- 


hand side of equation (33 1 depends only on the unknown cr, while its right-hand side has the fixed value tv -1 x 1 <7 2 (Y v , Y n ) , 
given by (32 1 . Since Proposition [7] states the solution d N of this equation exists and is unique for any value of its right-hand 
side, it follows that there exists a function : R+ -> R+, such that 


d N = $(N- 1 xT,» =1 d 2 {Y N ,Y n )) 


(34) 


In view of equation (33 1 , this means that <I> is the inverse function of cr 1 —> cr 3 x d log £(<r)/dcr. The argument of the function <T> 
in (34 1 gives the mean dispersion of the samples Yi,..., Y N away from Yy . On the other hand, from the proof of Proposition 
[7] it can be seen that $ is a strictly increasing function. Indeed, up to the change of parameter 77 1 — >• cr, this function $ is the 
same as the inverse function of -0'( 77 ), which is strictly increasing. Thus, (34 1 shows that d N is a direct measure of the mean 
dispersion tv -1 x J2„=i d 2 (Y N , Y n ), obtained by application of the strictly increasing function $. 

The following Propositionj8] deals with the problem of hypothesis testing. Precisely, it deals with the problem of testing 
against a given value Y 0 of Y, while cr remains unknown. This problem is described by the null and alternative hypotheses, 
(see (43) , for general background), 

H 0 :Y = Y 0 : Y f Y 0 (35) 

Proposition [8] expresses the log-likelihood ratio statistic T, used for testing If against //,,. 

Proposition 8 (Likelihood ratio test): The log-likelihood ratio statistic, corresponding to hypotheses is the following 

'C($(Jv- x xE Z =1 d 2 (Y N ,Y n )))\ , £~ =1 d 2 (Y N , Y„) d 2 (Y 0 , Y n ) 


T = log 


((<!>( tv -1 x En=i ^ 2 ( Y 0 , Y„))) I 2<f>( tv - 1 x e”=x d 2 { Y N , Y n )) x d 2 ( Y 0 , Y„)) 


(36) 


The likelihood ratio test rejects H 0 when T > k and accepts it otherwise, where k is chosen to fix the level of significance. 
Proof: By definition [43), the log-likelihood ratio statistic is 

sup n" = iP(Y n \Y,<r) 

Ho 


T = log s sup n^piXnlY,^ 
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where the supremum in the numerator is over all possible values of (Y. a), and the supremum in the denominator is over only 
those values which belong to hypothesis H 0 . Expression (36] follows by noting that the supremum in the numerator is realised 
for (Y,a) = ( Y N ,a N ), by definition of maximum likelihood estimates, and the supremum in the denominator is realised for 

(Y,a) = ( Y o, £o). where 

<r 0 = <f>( JV" 1 x £™ =1 d 2 (Y 0 , Y n )) 


The remaining part of the proposition is the usual definition of the likelihood ratio test, as in (43]. ■ 

Admittedly, expression (36] for the log-likelihood ratio T is quite involved. This is due to the fact that hypotheses H 0 and 
H 1 of (35] are composite hypotheses, depending on the unknown parameter a. If it were known that a = a 0 , then T would 
reduce to, (as in the proof of Proposition [8] this is by definition of the log-likelihood ratio), 

T = ^ { J2n=i d 2 (Y N ,Y n ) - £« =1 d 2 (Y 0 ,Y n )} (37) 


so the likelihood ratio test amounts to comparing the dispersion of the samples V),..., Y v away from Y N , to their dispersion 
away from Y 0 . In all generality, it can be shown on the basis of Proposition 11 of Paragraph III-C| that the asymptotic 
distribution of T under hypothesis H 0 is a chi-squared distribution (this is in the limit N —> oo). Using this asymptotic 
distribution, it becomes straightforward to choose the value of k in order to obtain a preassigned level of significance (i.e. false 
alarm probability, or alpha-level). These statements yield a direct generalisation of Wilks’ theorem (43) (Page 132). 


C. Riemannian centre of mass and asymptotic properties 

The present paragraph recovers the as ym ptotic properties of the maximum likelihood estimate Y N of the parameter Y, in the 
limit N —» oo . Recall, from Proposition |7j the estimate Y N is the empirical Riemannian centre of mass of samples Y\..... Y N 
from the distribution G(Y",<r). 


These asymptotic properties of consistency and of asymptotic normality of Y N will be given in Propositions 10 and HU 


respectively. Proposition 10 states that Y N converges almost surely to Y, while Proposition HU states that Y N , in a sense made 


precise by the proposition, is asymptotically normally distributed about Y. 

It is outside the scope of the present paper to establish the further property of asymptotic efficiency of Y N . This property 
can be given a precise formulation on the basis of the work of Smith |44], (see Corollary 3, Page 1619), and is the focus of 
ongoing work. 

rely on the following Propos ition [h] which states that Y is the Riemannian centre of mass of the 


Propositions 10 


and 


11 


distribution G(Y,a), (in the sense defined after ( | 1 9] in |II-B| ). 

In the statement of Proposition [9] the following notation is used. Recall definition (T9] of the variance function £~ of a 
probability distribution n on V rn . If ir is the distribution G(Y, a), then £ n (Y) will be denoted £(Y \ Y, a). 

Proposition 9 (Centre of mass of a Gaussian distribution): For any Y £ V m and a > 0, the following properties hold for 
the Gaussian distribution G(Y,a). 

(i) Y is the Riemannian centre of mass of G(Y,a). That is. 


Y = argmin r £(Y \ Y, a) 


(38a) 


(ii) er is given by 

cr = $^J p d 2 (Y,Z)p(Z\Y,a)dv{Z)^ (38b) 

where the function *I> was introduced in (34]. 


Proof: The proof of this proposition is given in Appendix [A] A more general, information-theoretic, proof of item (i) was 
given by Pennec | 20) , (Theorem 3, Page 142). ■ 

Proposition 10 states that the maximum likelihood estimate Y N converges almost surely to the true value of the parameter 
Y, when N —t oo. 

This property of Y N is called consistency. In asymptotic statistics, consistency of maximum likelihood estimates is a well- 
known general result H). However, any statement of this result requires several additional technical conditions. Here, no such 
conditions will be necessary. Proposition [TO] follows immediately from Proposition [9] using a theorem on the convergence of 
empirical Riemannian centres of mass ]23|7^ 

Proposition 10 (Consistency of Y N ): Let V',, K>,... be independent samples from a Gaussian distribution G(Y, a). The 
empirical Riemannian centre of mass Y N of V,..... K v converges almost surely to Y, as N — > oo . 


Proof: According to (23] | (Theorem 2.3., Page 8), if Y L ,Y 2 , ... are independent samples from a probability distribution n on 
Vm , and Y n is the empirical Riemannian centre of mass of V',, Y N , then Y N converges almost surely to the Riemannian 
centre of mass of n, as N —t oo. 
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Applying this theorem to tt = GiY , a), it follows that Y N converges almost surely to the Riemannian centre of mass of 
G(Y,a). However, by Proposition hll this Riemannian centre of mass is exactly V. The proposition is thus proved. ■ 

Proposition |TT] is concerned with the asymptotic distribution of, roughly speaking, the “difference” between Y N and Y. 
When the space V m is considered as a Riemannian manifold, equipped with the Rao-Fisher metric, the natural definition of 
“difference” between Y N and Y is using the Riemannian logarithm mapping, (already mentioned in the proof of Proposition 
[9] formula (|78[i). This will be denoted 

A = Log V (Y v ) (39) 

The Riemannian logarithm mapping is defined as follows. Recall that any Y, Z £ V m are connected by a unique geodesic 
curve 7 (i), given by (| 8 |. The Riemannian logarithm Log Y (Z) is an rn / rn symmetric matrix, equal to the initial derivative 

M (0)- From 

Log Y (Z) = Y 1/2 log (Y- 1 / 2 YY“ 1/2 ) Y 1/2 (40) 


where log is the symmetric matrix logarithm. 

In order to study the asymptotic distribution of the matrix A of (39 1 , (that is, the asymptotic joint distribution of matrix 
entries), it is desirable to have a vector representation, identifying A with an array (A a ;a = 1 ..... p) where each A Q is 
real-valued. 

To obtain this, consider for each Y £ V r „ the scalar product, defined for any symmetric rn x m matrices v and w, 


(v, w) Y = tr [Y x vY 1 w] 


(41) 


This scalar product is equivalent to the Rao-Fisher metric (in Riemannian geometry, this is taken to be the definition of 
the metric @@)- 
Let A a be given by 

A a = (A ,e a )f a=l,...,p (42) 


where (e a ; a = 1.....p) is an orthonormal basis for the scalar product (41 1 , taken at Y = Y. 

Then, (A a ; a = 1,... ,p) are the components of A in the basis (e a ; a = 1,... ,p), so the asymptotic distribution of A is 
completely determined by the asymptotic distribution of the vector representation (A a ; a = 1,... ,p). 

Proposition[TT]states that the asymptotic joint distribution of the scaled components N 1/2 x A Q is a centred normal distribution 
with covariance matrix equal to 4<r 4 x C -1 , where C is the p x p symmetric positive definite matrix 

Cab = 4 x J ^ [A a (Z) x A b (Z)\ p(Z\Y,a)dv(Z) a,b=l,...,p (43) 

with, as in ( |39| and ( |42| ), 

A a (Z) = (Log V (Z), e a )v for Z £V m 

In the statement of Proposition [IT] £{•} denotes the probability distribution of the quantity in curly brackets, and =7 denotes 
convergence of probability distributions. 

Proposition 11 (Asymptotic normality ofY N ): Let (A a ;a = 1 ,p) be given by (42 1 . In the limit N —> 00 , 

C{N 1/2 x (A l5 ..., A p )} => AA(0,4a 4 x C~ x ) (44) 

where A/"(0,4 a 4 x C _1 ) is the centred normal distribution on R p with covariance matrix 4cr 4 x C -1 , with C given by |43| >. 

Proof: The proof follows from two lemmas, stated below. Lemma[I]establishes asymptotic normality of N 1/2 x (Ai, • ■ ■, A p ), 
and Lemma [2] shows the asymptotic covariance matrix is equal to 4tr 4 x C -1 . 

— Lemma 1 (Asymptotic normality): In the limit N — ► 00, 

£{N X ' 2 x (A,,..., A p )} =► AA(0, A) (45) 

where A = 11 ~ 1 C H~ 1 and H is the p x p symmetric positive definite matrix, 

H ab = V 2 £(Y)(e a ,e b ) a,b=l,...,p (46) 

Here, £ is the variance function Y 1 —> £(Y\ Y,<r), defined before Proposition [ 9 ] and V 2 £(Y) is the Riemannian Hessian of 
this function, evaluated at Y. 


— Lemma 2 (Asymptotic covariance): The asymptotic covariance matrix A, appearing in (45 1 , verifies A = 4er 4 x C 1 . 
The proofs of these two lemmas are given in Appendix |B| A result similar to Lemma[T]can be found in ]24) , (Remark 2.2., 
Page 1232), while Lemma [2] is motivated by a result in |44| , (Theorem 1, Page 1614). 

If Lemmas [Tland [2] are admitted, the proposition follows immediately. Indeed, replacing A = 4tr 4 x C -1 , as stated in 
Lemma [2j intop5]>, produces (44». ■ 
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IV. Mixtures of Riemannian Gaussian distributions 

Mixtures of parameterised probability distributions on a Euclidean space are widely studied in the statistical literature |26)|27). 
The present section generalises these mixtures, from Euclidean space to the space V m , considered as a Riemannian manifold. 
It does so by introducing mixtures of Riemannian Gaussian distributions on V m . 

For shortness, mixtures of Riemannian Gaussian distributions will be called mixtures of Gaussian distributions. A mixture 
of Gaussian distributions is a probability distribution on V m with probability density function 


p(x) = ^2 x Y t*’ ( 47 ) 

M= 1 

Precisely, this is a probability density with respect to the Riemannian volume element of V m , given by ( | 1 ()[>. H ere, w M 

are non-zero positive weights which satisfy m i + .. .+m M = 1, and each density p(Y\ Y^,<7^) is given by (20 1 , with parameters 
Y/j, £ V m and er M > 0. The number M will be called the number of mixture components. 

The remainder of this section is devoted to the estimation of mixtures of Gaussian distributions, and to their use in the 
classification of data in V m - The results which are developed are partially based on a recent work by the authors |45| . They 
are aimed to generalise analogous results, for mixtures of parameterised distributions on a Euclidean space. 

Paragraph |IV-A derives a new EM (expectation-maximisation) algorithm, for computing maximum likelihood estimates of 
mixture parameters i? = {(-zu^, Y fI . cr M ); p = 1,..., M}. 


Paragraph IV-B proposes a new Bayes classification rule, based on mixtures of Gaussian distributions, for the classification 
of data in V m . 

The present section is motivated by the idea that the class of mixtures of Gaussian distributions is expected to be sufficiently 
rich, in order to represent the statistical distribution of data in V m which arise in real-world applications. Roughly, this 
is because one expects that any probability density on V m can be approximated to any required precision by a mixture 
of Gaussian distributions, provided the number M of mixture components is suitably large. As stated in the introduction, 
experimental verification of this idea is still ongoing 


A. A new EM algorithm for mixture estimation 

Let Yi,...,Y n be independent samples, drawn from a mixture of Gaussian distributions, given by ©■ This paragraph 
considers the task of computing maximum likelihood estimates of mixture parameters d = 5L, <7 M );p = 1 

based on the samples Y \,..., Y N . 

This task will be realised using a new EM (expectation-maximisation) algorithm, which generalises, to the context of the 
Riemannian geometry of the space V m , the currently existing EM algorithms used in the estimation of mixtures of parameterised 
distributions on a Euclidean space ]26) . 

It is assumed, throughout the following, that the number of mixture components, denoted M in ( |47j ), is known and fixed. 
The problem of determining a suitable M, in view of the samples Y \..... Y N , is known as order selection. This problem is 
not considered here, as it falls outside the scope of the methods used in the present paper. In principle, it could be tackled by 
application of existing general methods, such as those based on information criteria J46) . 

In the remainder of this paragraph, the new EM algorithm will be derived, based on the general principle set forth in the 


founding paper [47 . This algorithm iteratively updates an approximation i) = { {zu lu Y p , < 


,)} of the maximum likelihood 
estimates of mixture parameters if by repeated application of so-called E (expectation) and M (maximisation) steps. In order 
to specify these two steps, the following setting is needed. 

To begin, assume the mixture model ( |47| was generated based on latent variables Li,..., L N , defined as follows, (compare 
to |26) , Page 84). Let ..., L N be independent identically distributed random variables, which take the values 1,..., M, 
with respective probabilities, (here, P denotes an underlying probability measure), 


’ (L n = p) = tz7 p 


(48) 


In order to obtain samples Y \,..., Y N , from the mixture distribution (47 1 , assume the couples ( L n . Y n ) are independent, and 

P(Y n =Y\L n =p)=p(Y\Y fl ,a^) (49) 

Intuitively, L n is the membership label of Y n , so L n = p indicates that Y n belongs to component // of the mixture. 

The distribution of Li,..., Y N can be recovered from (|48]> and (|49[i. Indeed, 


’ (Yn = Y) = F (L n = p) x P (Y n = Y\ L n = p) 


M =1 


= ^ x p(Y\ Y^, aft) 

n=l 
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which is exactly ©• 

Let C denote the number of L n which are equal to p . That is, 

N 

C n = '51 1 {Ln=i*} (50) 

n—1 

where t A denotes the indicator function of event A, (equal to 1 if A holds and to zero otherwise). Given the interpretation of 
the L n as membership labels, C ; , is called the membership count of component // of the mixture. 

Consider now the quantities w M (Y ra ) and N fl , defined as follows, (see discussion in [|47j. Section 4.3.), 

= E (l { i, n=#l} | Yi,..., Y n ) 

= E (L n = Y n ) (51) 

N li =E[C li \Y 1 ,...,Y„] (52) 


where E denotes expectation. Given these definitions, one calls u>^(Y n ) the conditional membership probability, and Aj, the 
conditional membership count, of component /i of the mixture. These are given by the following formulae. 


N 

^n{Y n ) OC X p(Y n \ CT m ) N, = Y: 

n—1 


(53) 


which follows from (51 1 and (52 1 , after a simple application of Bayes’ theorem. Here, the constant of proportionality, 
corresponding to “oc”, is chosen so that u>i(Y n ) + ... + u> M (Y n ) = 1. Moreover, it is clear that 


M=1 


= N 


(54) 


The E and M steps can now be specified, using definitions (51 1 and (52!. To emphasise that these two definitions are applied 
under a given value of •& = {(tY M , cr^)}, the following notation is used. 


= W/1 (Y n , i?) N„ = N^) (55) 

The general form of the E and M steps is the following (47) , 

► E step: based on the current value of r d, compute 

Q(0|t?) = [Y^J{L n ,Y n ) \Y u ...,Y n ] (56) 

where denotes expectation under the value x) of the mixture parameters x), and where £(L n ,Y n ) is the joint log-likelihood 
of L n and Y n . In other words, the E step computes the conditional expectation of the complete log-likelihood, based on the 
current value of x). 

► M step: assign to d the new value 

t? new = argmax^ Q(t?|t?) (57) 

In other words, the M step updates d by maximising the conditional expectation of the complete log-likelihood, as provided 
by the E step. 

In order to carry out the E step, it is enough to note 


t{L n ,Y n ) = E“=i l{L„=^}{log + logpiY^Y^a^)} 


(58) 


as follows from (48 1 and (49!. After replacing expression (20 1 of p(Y n Y„, a t ,) and performing some algebraic transformations, 
the conditional expectation in (561 is found to be, (see again [47) , Section 4.3.), 


= E"=i |A r M (i?){logti7 A1 -logC(cr M )} - n ,fi) d 2 (Y^Y n )/2al} 


(59) 


In order to carry out the M step, it is required to maximise this expression with respect to •& = {Y^. ct ; ,)}. The 
maximisation can be carried out, first over the values of , then over those of Y jL , and finally over those of cr M . 

Note that enters (59 1 only via the expression 


M 


log wf, 

1 


(60) 


By Jensen’s inequality, applied to the concave function log, this is maximised, (under the constraint w\ + ... + w M = 1), 

w*™=N^)!n (61) 


when vo M where 
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Maximising (59i over the values of Y^ reduces to minimising, separately, each of the expressions, 


N 


£^Y) = Y / ^(Y n ,V) d \Y, Y n ) 


(62) 


over the values of Y £ V„ 


. The corresponding minima will be denoted V'" ew . 

Finally, maximising (59 1 over the values of rr /( can be carried out by differentiating with respect to a /( . 


derivative equal to zero. A direct calculation shows that this yields the solution 

K ew = *( V W x u„(Y nt 1 9)d 2 (%, Y n )) 


and setting the 


(63) 


where the function >1) was defined in ([34]), Paragraph III-B 


From the above, it is seen that the new EM algorithm, from a practical point of view, consists in repeated application of 
the update rules ([(4), (62 i and ( [63] ). in this same order. These update rules should be repeated for as long as they introduce 


a sensible change in the values of , T ;i , and <j m . Other stopping criteria, involving the amount of increase in the joint 
likelihood function of , V ; ,, and cx M can also be used. 

Realisation of the update rules for and tf M is rather straightforward. On the other hand, the update rule for Y fl requires 
minimisation of the function £,, : V m —> R+, defined by (62 1 . This is a function of the form given by (19 1 , in Paragraph II-B 


so Proposition 3 g uarantees that it has a unique global minimiser Y7 lew , which is also a unique stationary point. 

Comparing J62b to the general expression (19 1 , it appears clearly that Y / j lew is the Riemannian centre of mass of the 
probability distrTBution on V m , given by 


UJn 


= ywr n ) x 6 y „ 


n—1 


(64) 


Here, recall that 5y denotes the Dirac measure concentrated at Y £ V m . As mentioned after Proposition [7] there exist several 
algorithms for computing Riemannian centres of mass [2T]|[40)-||42). One of the most familiar among them is the Riemannian 
gradient descent algorithm, which may be found in 12 . The A:th iteration of this algorithm produces an approximation of 
l^ lew in the following way. For k = 1,2,, let A fc be the symmetric matrix 


= Y 

n—l 


(Y n ) x Log^fc-i (Y n ) 


where Log denotes the Riemannian logarithm mapping, given by (40 1 . Then, V', 

Y* = Exp 


k is defined to be 


v. 


(n A k ) 

where Exp is the Riemannian exponential mapping, inverse to the Riemannian logarithm mapping, given by 

Exp y (A) = Y 1/2 exp (r 1 / 2 Ar 1/2 ) Y 1/2 

with exp the matrix exponential, and where r k > 0 is a step size, to be determined using a backtracking procedure. 


The Riemannian gradient descent algorithm, specified by (65 i and ( 661 , is repeated as long as ||A fe || > e, where ||A 


(65) 


( 66 ) 


(67) 


is 


given by ([6]) and e is a precision parameter, to be chosen by the user. This algorithm is guaranteed to converge, when a suitable 
backtracking procedure is used, regardless of the initialisation Y°. A discussion of this convergence, in the case where Armijo 


backtracking procedure is used, can be found in |48| (Theorem 4.3.1., Page 65). 

The Riemannian exponential mapping ( |67| ), which is difficult to compute, can be replaced by certain so-called retraction 
mappings, which are less computationally demanding, without any change in the convergence of the algorithm. Some of these 
retraction mappings are given in [49), while a general characterisation of the class of suitable retraction mappings can be found 


in 50] (see, in particular. Paragraph 3.4., Page 11). 


B. Classification using mixtures of Gaussian distributions 

Mixture distributions lead to popular tools for classification of data which lie in a Euclidean space 1511. Moving beyond 
this usual Euclidean setting, the present paragraph considers the use of mixtures of Gaussian distributions in the classification 
of data in Vm ■ 

Classification of data which lie in V m is an important challenge for several applications, including remote sensing 114)|29), 
computer vision [15][52|, and medical imaging ]8)]53). Most of these applications have used already existing classification 
techniques, such as Bayes classification (see (51) , Section 2.4.), used in 1 29) 152][531, or regression techniques (see [51) , 
Section 2.7.), used in GDGD- On the other hand, some applications have evolved new classification techniques based on the 
Riemannian geometry of V m . This is the case of whose approach, (discussed in the introduction of the present paper), is 
based on using Rao’s Riemannian distance. 
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In the present paragraph, the focus will be on supervised classification. A new classification rule is introduced which, like 
those of j29)|52)|53), implements the principle of Bayes classification, but which also uses Rao’s Riemannian distance, as 
in |8|. Precisely, this new classification rule is a Bayes optimal classification rule, using posterior membership probabilities, 
computed by the EM algorithm, described in the previous paragraph. 

The following setting is considered, (similar to 0|||29![52)|53)). Assume known a training sequence T. Precisely, T C V m 
is a set of data points, produced in some real-world application. Assume also known a partition of T into disjoint classes. Each 
class C C T results from a well-defined, distinct experiment, (for example, application of a measurement device to a given 
object). The data points which make up each class C, while arising from the same experiment, may still display heterogeneous 
properties, (for example, if they are produced from measurements taken in different conditions). Therefore, each class C may 
further break down into clusters C\,..., C M , where M depends on the given class C. Accordingly, the training sequence T 
can be partitioned directly into disjoint clusters, say C\,..., C K , where I\ is the total number of clusters within the training 
sequence. Then, the data points which belong to each cluster display essentially homogeneous properties. 

In this setting, one has to carry out the two following tasks. First, for each class C C T, to identify the clusters C \,..., C M 
within this class. Second, whenever a new data point is produced, to associate this data point to a suitable cluster among 

c u ...,c K . 

For the first task, it is required to perform a clustering analysis of the data points in each class C. In |28j|52], these data 
points are modeled as a realisation of a mixture of Wishart distributions. Then, clustering analysis is performed using an EM 
algorithm for estimation of mixtures of Wishart distributions, or some variant of such an algorithm. 

Here, in view of introducing a classification rule which uses Rao’s distance, Gaussian distributions are chosen over Wishart 
distributions. That is, the data points in each class C are modeled as a realisation of a mixture of Gaussian distributions, given 
by (|47j), rather than a mixture of Wishart distributions, as in |28)|52|. Then, clustering analysis is performed using the EM 


algorithm of Paragrah IV-A 


Precisely, this algorithm computes M triples of maximum likelihood estimates, $ = Y^, a^); p = 1 ,,M} . These 


are used to identify clusters { C fl : p = 1 within the class C, in the following way. If C = { Y n 


= 1 , 




then each data point Y n is associated to the cluster C„* which realises the maximum conditional membership probability 


w li *(Y n ,t?) = max M (recall the notation of (p3b and (55 i). 


With the first task being realised as just described, the training sequence T is partitioned into disjoint clusters, say { C K ; k = 
1,..., K}. Moreover, through the mixture model ( |47| >, each cluster C K is identified with a Gaussian distribution C(Y(n). <j(k )), 
and may be represented by a triple of maximum likelihood estimates, (w(n), Y(n), cr(ft)). 

For the second task, assume a new data point Y t , called a test data point, has become available. The optimal way, (in the 
sense explained in f5l), Section 4.3.), of associating this data point to a cluster C K , consists in applying the following Bayes 
classification rule: associate Y t to the cluster C K - which realises the maximum 


N(k*) x p(Y t | C K <.) = max K N(k) x p(Y t \C K ) 


( 68 ) 


where N(k) is the number of data points in C K , and p(Y t \ C K ) denotes the density of Y t , assuming it belongs to C K . This is 
called a Bayes classification rule because the product N(k) x p(Y t C K ) implements Bayes formula, with C K assigned a prior 
probability P(k) oc N(k). 

The new classification rule, introduced in the present paragraph, evaluates ( |68l> u sing the representation of cluster C K by 
maximum likelihood estimates (w(k), Y(k), <t(ac)). Recall these are given by (611, (62 1 and (63 i. From (611, zb(n) is a 


consistent estimate of P(/«). On the other hand, in the present context, the density p(Y t \ C K ) is p[Y t \ F(k), ct(k)), which is 
consistently estimated using p(Y t \ <f(ft)). Accordingly, the new classification rule is the following: associate Y t to the 
cluster C K * which realises the maximum max K w{k) x p(Y t | Y(k), o'(k)). 

Recalling expression (20( for p(Y t \ Y(k), ain)), the proposed classification rule takes on its final form. Precisely, C K - is 


the cluster which realises the minimum 


min < - log w(k) + log ((o-(k)) + 


d 2 (Y t ,Y( K )) 
2a 2 (k) 


(69) 


as can be found from ( [20]) an d ( |68j ) after taking logarithms and changing sign. 

The classification rule (|69| can be interpreted as follows. This rule prefers clusters C K having a larger number of data points, 
(the minimum contains — log vj(k)), or a smaller dispersion away from their Riemannian centre of mass, (the minimum contains 
log C(<j(k))). When choosing between two clusters with the same number of points and the same dispersion, this rule prefers 
the one whose Riemannian centre of mass is closer to Y t . 

If the role of respective sizes and dispersions of clusters is neglected, (recall the size of a cluster refers to the number of its 
data points), then d69| reduces to a nearest neighbour rule, which chooses C K - in order to realise the minimum 


min K {d(Y t , Y («))} 


(70) 


This nearest neighbour rule is the main subject of (0. 
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To close this paragraph, note that the general Bayes classification rule ( 681 was also applied in [29], but with p(Y f \ C K ) 
given by the density of a Wishart distribution. Assume this Wishart distribution has expectation E(/c) € V m and number of 


degrees of freedom n(re), (for the definition of these parameters, see [38]). Then, by the same reasoning leading from ( 681 to 
one obtains the Wishart classifier, which requires C K * to realise the minimum 


|log w(n) — n(n) (logdet (E 1 (K,)Y t ) - tr(S 


(71) 


where zu(k), E(k) and fi(ft) denote maximum likelihood estimates, which can be computed as in 128 52 


The following Section compares the performance of classification rules (69», {TO} and {7T}. through a numerical experiment 
carried out on real data. 


V. Numerical experiment 

Among the many applications which require an effective approach to the classification of data in V ,„, is the problem of 
texture classification, in computer vision. 

The present section applies the approach developed in Paragraph |IV-B| to this problem. It shows that, in the context of this 
concrete application, the new classification rule ( [69] ) offers significantly better performance than classification rules (|70|) and 
proposed in [8] and [29), respectively. 

This section is organised as follows. First, a brief introduction to the problem of texture classification is given. Second, a 
numerical experiment is described, where classification mles (691, (70 1 and ( |7 1 [ ) are applied to this problem. Finally, the results 
of this numerical experiment are summarised and commented, based on Table [I] below. 

In computer vision [54] , a texture is a pattern of local variation in image intensity, observed on a fixed scale. For example, 
text printed on white paper, as on the current page, constitutes a texture. In a remote sensing image, regions belonging to 
agricultural land, forest, and urban area give rise to different textures. 

The problem of texture classification involves textures belonging to one of a set of predefined classes. In remote sensing, 
these could be agricultural land, forest, urban area, etc. The problem is to assign any new image, (more often, subregion of 
an image), to a suitable class. This problem is fundamental to medical image analysis, remote sensing, and material science, 
among other fields. 

The relationship between the problem of texture classification, on the one hand, and the problem of classification of data in 
V m , on the other hand, can be described as follows. 

Many popular mathematical representations of texture are based on statistical modeling of wavelet coefficients J55)—f58|. 
These representations have often been found useful, and are also justified by physiological and psychological studies of human 
visual perception 59) 60). 

In recent work |57||58), the statistical modeling of wavelet coefficients was carried out using multivariate probability 
distributions, which depend on a parameter Y £ V m ■ Such distributions include, for example, multivariate generalised Gaussian 
distributions, used in fZD’ which are parameterised by two positive scalar parameters, called the scale and shape parameters, 
and by Y £ V,„ , called the scatter matrix. 

Accepting the mathematical representation that “texture ss probability distribution of wavelet coefficients’'’ it is clear that, 
based on models such as those of |57||58], a texture can be described by a parameter Y £ V m , (of course, one may wish 


to include other parameters). Then, the problem of texture classification becomes identical to the problem of classification of 
data in V m , described in |IV-B| 

In order to apply the classification rules considered in |IV-B| to the problem of texture classification, a numerical experiment 
was carried out, using the Vision Texture image database (VisTex) ]30[ . The Vision Texture database contains images, (size 
512 x 512 pixels), of different materials or objects, such as bark, metal, brick, buildings, clouds, etc. 

Starting with 40 images, from this database, each image was subdivided into patches, (size 128 x 128 pixels, with 32 
pixel overlap). These patches are known to have the appropriate scale, at which textures can be observed. Therefore, it is 
possible to consider that each patch constitutes an individual texture, (This setup is usual in experiments using the VisTex 
database [55][58)). 

The subdivision of 40 images into patches produces 169 patches per image. Out of these patches, 84 were used for training, 
and the remaining 85 for classification. Training refers to computation of the maximum likelihood estimates appearing in the 
classification rules (69), and m Once this is realised, classification is carried out by applying the rules <[69]), {70]) and 
[7l] to the “classification patches,” in the aim of assigning each patch to a suitable class, in terms of the material or object 
which it represents. 

Table [I] shows the performance of each one of these three classification rules, in terms of so-called overall accuracy. Overall 
accuracy is the percentage of classification patches, (out of a total number of 85 x 40), which are correctly classified. The 
results given in Table |T] are averaged over 100 realisations of the experiment. For each realisation, the choice of training and 
classification patches, within the 169 patches in each image, is generated at random. The mean and standard deviation of 
overall accuracy for each classification rule are given in the format mean ± standard deviation. 
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TABLE I 

Supervised classification in the VisTex database 


Classification rale Overall accuracy 


rale 

rale 

rale 

rale 

rale 

rale 


69 

59 

70 
70 

7T 

7T 


with M = 3 
with M = 1 

with M = 3 
with M = 1 

with M = 3 
with M — 1 


94.3 ± 0.4 % 
86.2 ±0.4% 
92.1 ± 0.5 % 

82.5 ± 0.5 % 
89.7 ± 0.8 % 

84.6 ± 0.5 % 


In order to understand Table [I] consider the application of the classification rules ( |69] , {70] and CD- in the context of the 
present experiment, (the following description is simplified in order to maintain a reasonable length, but a fully detailed version 
may be found in {6l|). 

The experiment reproduces the setting of supervised classification, described in |IV-B Indeed, the starting point is a training 
sequence T, made up of 84 x 40 textures. This training sequence is partitioned into disjoint classes, where each class C consists 
of textures which belong to the same image, (therefore, these textures represent the same material or object). Since there are 
40 images, T is partitioned into 40 classes. 

The textures belonging to each class C display a diversity of lighting conditions and perspectives. To model this “in-class 
diversity,” it is assumed that C may be made up of disjoint clusters C±,, C M , (eventually, M = 1 is considered), where 
textures belonging to the same cluster display similar lighting conditions and perspectives. 

The training sequence was used to compute the maximum likelihood estimates appearing in the classification rules (69 1 , 


(70 1 and (71 1 . Then, these three classification rules were applied to each one of the 85 x 40 classification patches. Simply put, 
the aim is to retrieve, for each classification patch, the original image from which it was obtained, (and thus the material or 
object which it represents). 

Assume, for simplicity, each texture in the training sequence T is described by a parameter Y £ V m , (here, m = 2 is 
used). It follows that each class C, within this training sequence, can be identified with a set of data points in V m , say 
C = {Y u ...,Y n }. 

For classification rules ( [69] ) and (|70]), Y),... ,Y N are modeled as a realisation of a mixture of Gaussian distributions, given 
by For classification rule ( f7T] l, they are modeled as a realisation of a mixture of Wishart distributions, as in J28| (52]. In 
either case, the number of mixture components is M, the number of clusters contained in C. 

The maximum likelihood estimates appearing in rules ( |69] and ( f70[ were computed using the EM algorithm of |IV-A[ 
and those appearing in rule ( f7T| ) were computed using the EM algorithm of [28|[52j, for estimation of mixtures of Wishart 
distributions. These algorithms were applied to the points Y \,..., Y N , of each class C, using two choices of the number of 

mixture components M. These are M = 1, (which amounts to ignoring in-class diversity), and M = 3. It was assumed, 

(clearly, this is an artificial assumption), that M is the same for all classes contained in the training sequence. 

Table |T] gives the respective performance of classification rules ([69]), ([70] and CD- This was obtained by applying these 
rules to each one of the 85 x 40 classification patches. Recall that patches, since they have a suitable scale, can be identified 
with textures. Therefore, each classification patch was described by a parameter Y t £ V m , which was then used to evaluate 
the rules (69], {70] and ( |7Tj ). 

Table [I shows classification performance in terms of overall-accuracy. As already mentioned, this is the percentage of 
classification patches which are correctly associated to their original class, (that is, to the image from which they were 
obtained). Note that rules ([69], {70] and {71] provide additional information, as they associate each classification patch to a 
specific cluster, a subset of a class. This has no effect on the results shown in the table. 

A quick look at Table [I] shows the following : 

— The new classification rule ( [69] , proposed in IV-B[ provides significantly better performance than the nearest neighbour 

rule and the Wishart classifier rule ( fTT] , which were proposed by |8j and (29j, respectively. 

— The nearest neighbour rule {70] provides better performance than the Wishart classifier rule (71 1 , when the number of 

mixture components is M = 3. However, this is the other way around, when M = 1. 

— For each of the three classification rules, performance is improved when training is carried out using M = 3, instead of 
the trivial choice M = 1, (which reduces the mixture to a single component). 

Validation of the new classification rule ( [69] , through application to other databases, besides the VisTex database, and also 
to images acquired from remote sensing projects, is currently ongoing. This will include comparison of this new classification 
rule to a more comprehensive choice of specialised classification techniques, as may be found in the texture classification 
literature. 
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Appendix A 

Proof of Proposition^ 

This appendix provides the proof of Proposition [9] Paragraph |III-C| A different proof of the first item of the proposition, 
formula ( |38a| >, is given in [22], (see Theorem 2.3., Page 598). 

— Proof : recall that Y is the Rie mann ian centre of mass of G(Y , a), if it is a global minimiser of Y i—» £{Y | Y. er). This 
is the same as saying that Y verifies (38aI. Write £{Y) in place of £(Y\ Y,a). Proposition [ 3 ] states that Y is the Riemannian 
centre of mass of G(Y, a) if Y is a stationary point of £ : V m — > R+. 

It is now shown this is true. Precisely, denoting V£ the Riemannian gradient of £, it is shown that 

V£(Y) = 0 (72) 

A well-known expression for V£ is the following, see 

Y£(Y) = - 2 f Vm Log Y (Z)p(Z\Y,a)dv(Z) (73) 

where Log v denotes the Riemannian logarithm mapping, (whose expression is <(40]», given below). 

To prove ( [72] ), note that for all Y £ V m , 

f m P(Z\Y,a)dv(Z) = l (74) 

since p(Z\Y, a), as defined by (20 1 , is a probability density. Taking the Riemannian gradient of both sides, it follows 

ML p{Z\ Y, cr)dv(Z)) = 0 (75) 

where V v means the gradient is with respect to the variable Y. Assume it is possible to carry the Riemannian gradient under 
the integral. The previous identity (75) then becomes 

f S7 Y p(Z\ Y, a)dv(Z) = 0 (76) 


Recall expression (20 1 of p(Z\ Y,a). Computing the gradient, one has, 

S7 Y p(Z\Y,a) = ^x-^xV,.(i 2 (Z,y)x exp - ^^P 

However, the Riemannian gradient of squared Rao’s distance is given by, (see ]62] , Page 407), 

X Y d 2 (Z, Y) = -2Log Y (Z ) 

Which means that ( [77] ) can be written 

S7 Y p(Z\ Y,a) = ^-x Log Y (Z) xp(Z\Y,a) 


(77) 


(78) 
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Replacing this expression in ( |76| > yields, 


f p Log Y (Z) p(Z\ Y, a)dv(Z) = 0 


(79) 


which holds for all Y £ V m - In particular, putting Y = Y, and comparing to (73 i, one immediately obtains ( |72| . This 
completes the proof of part (i) of the proposition, (the justification for carrying the gradient under the integral in (|75[) can be 
made using the dominated convergence theorem). 

In order to maintain a reasonable length, the proof of part (ii) of the proposition will not be detailed. It uses the same 


technique as the proof of part (i). Precisely, to obtain ( |38b[ ), differentiate both sides of < 74 1 with respect to a , and carry the 
derivative under the integral. It is then possible to conclude by a direct calculation. ■ 


Appendix B 

Proof of Proposition[TT] 


This appendix provides the two remaining parts of the proof of Proposition mi Paragraph |III-C| Namely, these two parts 
are the proofs of Lemmas □ and [ 2 ] 

— Proof of Lemma [lj: to begin, let 7 : [0,1] —► V,,, be the geodesic curve connecting Y to Y N . By definition. 


s 7 ( 0 ) = a 


(80) 


Let (e a (f);a = 1,... ,p), be a parallel orthonormal basis along 7 , with e a (0) = e a , where e Q is the basis introduced in (39i. 
For the notion of parallel orthonormal basis, see 1621. Then, it is possible to write, 

p 


VM7(*))=£ V£“(t)e„(i) 


a—1 


Indeed, this means that (V£“(f); a = 1,... ,p) are the components of the vector V£ N ( 7 (f)) in the basis (e a (f); a = 1,... ,p). 
From the Taylor development of the functions V£“(f), 


V£“(l) = V£“ (0) + Y, K£ N (e b , e a )( 7 (0)) A b + R> 


(81) 


6=1 


where R N denotes the remainder. This follows from ( 8 O 1 and from the Taylor development formula in |[62J (Page 83). 

For the left-hand side of (81 1 , note that 7 ( 1 ) = Y N is the global minimum of £ N . In particular, 7 ( 1 ) is a stationary point 
of £ N , so V£ w ( 7 ( 1 )) = 0 and 

Vf“(l)=0 (82) 


For the right-hand side, note that 7 ( 0 ) = Y. Therefore, by (32 1 and ( |78| ) 

N 

N 1 ' 2 x V£ n ( 7 (0)) = N~ 1/2 x-2Y Log r Y n 

n—1 

Furthermore, using (721 and ( [73] ), it follows from the central limit theorem, with C given by (43 1 , 

C{N^ 2 x (V4(0),..., V^(0))} => AA(0, C) 

In the second term on the right-hand side, 

N 

X7 2 £ N (e b ,e a )( 7 (0)) = TV -1 ^ V 2 d 2 (F, Y n ) 

n—1 

So, by the law of large numbers, with H given by ( |46| , 

V 2 fjv(e b ,e o )(7(0)) — >H as N ->• 00 

Finally, the remainder R N can be written, 

p 

R N = YKb^b 


where, by Proposition 
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H'ab verifies the limit 


Kb 


6=1 


0 as N —> 00 


(83) 


(84) 


(85) 


( 86 ) 


(87) 


( 88 ) 


Now, to prove the lemma, it is enough to multiply both sides of (81 1 by N 1/2 and substitute (82 1 , as well as the limits (84 1 , 
and 
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— Proof of Lemma [I]: the aim will be to prove that II = (l/2cr 2 ) x C. Indeed, Lemma [I] states that A = H -1 C H~ x . 
Therefore, H = (l/2cr 2 ) x C implies A = 4cr 4 x C -1 . 

To begin, let C(v,w ) be given by 


C(v,w) = 4 x [{Log Y (Z),v) Y x (Log Y (Z), w) Y ] p(Z\Y,cr)dv(Z) 


(89) 


for any symmetric m x m matrices v and w. Then, it is clear from (43 i, that C a b = C(e a , e-b)- Recall that H is given by (46 1 , 
in terms of the Riemannian Hessian Y 2 £(Y). To show that H = (l/2cr 2 ) x C, it will be enough to prove 

V 2 £(Y)(v, w ) = (1/2ct 2 ) x C(v, w ) (90) 

for all symmetric m x m matrices v and w. 

In order to obtain this equality, it will be convenient to introduce the following notation. For Y £ V,„ and a > 0, let 


p(Z\Y,a) =e^ Ya) 


£(Z\ 1» = — log C(a) - ^ d 2 (Z, Y) 


(91) 


as follows from (20 1 . Since p(Z\ Y,a) is a probability distribution, 

f v y^ Y ^dv(z) = i 

for any Y £ V,„. The above integral, as a function of Y, being constant, its Riemannian Hessian is equal to 0. By carrying 
the Riemannian Hessian under the integral. 


/ V 2 Y e e ^ Y ’ a) (v,w)dv(Z) = 0 


(92) 


for all symmetric m x m matrices v and w, where V;. indicates the Riemannian Hessian is with respect to the variable Y. 
The expression under the integral can be evaluated using the following identity, which holds for the Riemannian Hessian of 
the exponential of any function £, 

V 2 e^ z \ y '^(v,w) = { Vl£(Z\Y, a) (v, w) + (V Y £(Z\ Y, a),v) Y x (V y £(Z\Y, <j),w) y } e £ W Y ’ a) (93) 
where the scalar product notation is that of ( |4T| . 

This identity is shown in the short note following the proof. It is an immediate result of the definition of the Riemannian 
Hessian {62) (Page 3 and Page 41). Using ([91), the following expressions are obtained. 


Vy £(Z\ Y, a) = V v d 2 (Z, Y) = 4 Log Y (Z) 
lo 1 cr z 

V 2 v £(Z\Y,a) = -^ x V 2 d 2 (Z, Y) 

where the second equality in ( |94) follows from ( |78) . Replacing ( |93) , ( |94) and ( |95) into ( |92) gives 

- 2^2 x J rm Vy d 2 (z, y)(«, w)p(Z\ Y, a)dv(Z) 

+ x [(Log Y (Z),v) Y x (Log Y (Z),w) Y ] p(Z\Y,a)dv(Z) = 0 

as follows by using the notation of |9l) . After putting Y = Y, the last identity is exactly the same as, 

— —*— x X7 2 £(Y)(v,w) H- — x C(v,w ) = 0 

2 o' 2 v 1 4 a 4 


(94) 

(95) 


(96) 


which immediately gives (90 1 . This completes the proof of the lemma. 

— Proof of ( |93) : Recall the definition of the Riemannian Hessian of a function /, (see (62) , Page 41), 

V 2 f(Y)(v,w)=(V v Vf(Y),w) Y 


where V v denotes the covariant derivative, (see [621, Page 3), and Vf is the Riemannian gradient of /. If /( Y) = e'(Y), 
then it follows, 

V/(y) = S7l(Y) x e e(Y) 

Then, by the product formula for the covariant derivative, 

v„ V/(y) = V„ V£(F) x e^ Y) + Vf(F) x (V£(Y),v) Y e e ^ 

forming the scalar product with w, this yields 

X7 2 f(Y)(v,w) = \/ 2 £(Y)(v,w) x e e{Y) + (S71 (Y),w) y x (V£(Y),v) y e e{Y) 

which is the same as 493). 








